Controlling Software Pipelining

Controlling Software Pipelining

Software pipelining (SWP), an important optimization for the inner loops of programs, can cause dramatic improvement by rearranging the loops to overlap calculations from multiple iterations. This iterative process searches for an effective schedule and then for a workable allocation of registers, and then retries if either step fails. Some of the options in the -SWP group control that process. Other options control how the loop body is prepared for the attempt, for example, by unrolling.

Many important loop preparation transformations involve reassociation of floating point values. See the discussion of floating point optimization above, especially the "-OPT:roundoff=n" option.

SWP must normally be careful during the initial and final iterations of a loop to not perform extra operations that may cause run-time traps. It must be similarly careful if early exits from a loop (that is, before the initially calculated trip count is reached) are possible. Turning off certain traps at run time can give it more flexibility, producing better schedules and/or simpler wind-up/wind-down code. See the target environment option -TENV:X=n for general control over the exception environment.

-SWP:=(ON|OFF)

Enable/disable SWP (normally enabled at -O3).

-SWP:back_substitution[=(ON|OFF)]

The iteration interval of a pipelined loop, that is the frequency at which new iterations are started, is constrained by circular data dependencies across iterations, called recurrences. This option, ON by default, allows transformations that make recurrences less severe by substituting the expression that defines a variable for the variable. For example, consider the code:

DO i=1,n

a[i] = a[i-1] + 5.0

END DO

Without back-substitution, each iteration must wait for the previous iteration's add to complete, yielding a best case 2 of 4 cycles per iteration on the R8000. Back-substitution can transform the loop to code equivalent to:

DO i=1,n

a[i] = a[i-8] + 40.0

END DO

With appropriate initialization, this version can achieve an effective iteration interval of nearly 0.5 cycles.

-SWP:ii_backtracks_max=n

This option controls the limit on how many times SWP backtracks and searches for a schedule with ii (iteration interval) cycles before giving up and increasing ii. Increasing the limit improves its chances of success; decreasing it may improve the compilation time required.

-SWP:body_ins_count_max=n

SWP is not attempted for loop bodies containing more than n instructions (the default is 100, 0 for no limit). Larger loop bodies may be successfully pipelined, but take more compilation time in the attempt, so there may be a tradeoff of code improvement vs. compile time.

Loop bodies are normally unrolled in preparation for SWP. This also limits the unrolling, since loops are not unrolled to more than n instructions in the unrolled body. Unrolling is also constrained by the unroll_times_max option described below. (Unrolling of loop bodies not expected to be software pipelined is controlled separately by -OPT:unroll_size and -OPT:unroll_times_max.)

For example, compile tomcatv (a spec benchmark) with:

f77 -S -64 -O3 -mips4 -OPT:IEEE_arith=3:ro=3 -LIST:=ON tomcatv.f

A listing file tomcatv.L is produced. The results for the loop at line 32 in look like this:

Compiling tomcatv.f (tomcatv.f)
Options:
  -O3   (Optimization level)
  -g0   (Debug level)
  -m1   (Report warnings)

  -TARG:        (Target group)
    abi=64                      (64-bit ABI)
    isa=mips4                   (Instruction Set Architecture)
    processor=R8000
    madd=ON             (Allow madd instructions)

  -TENV:        (Target environment group)
    PIC=ON                      (Shared code)
    small_GOT                   (Assume GOT < 64KB)
    no_page_offset=OFF          (Use page/offset addressing)
    short_data=8                (Size limit for short data objects)
    short_literals=8            (Size limit for short literals)
    misalignment=0              (Misaligned data model)
    align_aggregrates=8         (Forced aggregate alignment)
    use_fp=OFF                  (Force frame pointer use)
    varargs_prototypes=ON       (Require prototypes for varargs routines)
    X=1                         (Exception suppression model)

  -OPT:         (Optimization group)
    div_split=OFF               (Use a*(1/b) for a/b)
    fast_complex=OFF            (Use fast complex norm/sqrt)
    fast_exp=OFF                (Use fast exp algorithm)
    fast_sqrt=OFF               (Use 1/rsqrt(x) for sqrt(x))
    fast_io=OFF                 (Use fast I/O intrinsics)
    fold_aggressive=OFF         (Use aggressive expression folding)
    IEEE_arithmetic=3           (Level of IEEE-754 compliance)
    IEEE_comparisons=OFF        (Don't eliminate comparisons like x==x)
    roundoff=3                  (Level of roundoff errors allowed)
    space=OFF                   (Optimize code space over execution time)
    vector_intrinsics=OFF       (Use vector intrinsics)

  -LIST:        (Listing group)
    =ON                         (Produce listing file)
    file=tomcatv.L              (Listing file name)
    performance=ON              (List performance information)
    source=OFF                  (List source code)
    symbols=OFF                 (List symbol table)
...
#<swps> Pipelined loop line 32 steady state
#<swps>
#<swps>     Not unrolled before pipelining
#<swps>     4 cycles per iteration
#<swps>     1 flop         (  6% of peak) (madds count as 2)
#<swps>     1 flop         ( 12% of peak) (madds count as 1)
#<swps>     0 madds        (  0% of peak)
#<swps>     8 mem refs     (100% of peak)
#<swps>     3 integer ops  ( 37% of peak)
#<swps>    12 instructions ( 75% of peak)
#<swps>     4 short trip threshold
#<swps>
#<swps>     4 possible stall cycles
#<swps>     4 min possible stall cycles
#<swps>
#<swps>

If you add the switch -SWP:body_ins=250, the amount of unrolling increases. The loop at line 32 becomes:

#<swps> Pipelined loop line 32 steady state
#<swps>
#<swps>   2 unrollings before pipelining
#<swps>   7 cycles per 2 iterations
#<swps>   2 flops       (  7% of peak) (madds count as 2)
#<swps>   2 flops       ( 14% of peak) (madds count as 1)
#<swps>   0 madds       (  0% of peak)
#<swps>  14 mem refs    (100% of peak)
#<swps>   3 integer ops ( 21% of peak)
#<swps>  19 instructions( 67% of peak)
#<swps>   2 short trip threshold
#<swps>
#<swps>   6 possible stall cycles
#<swps>   4 min possible stall cycles

-SWP:fix_recurrences[=(ON|OFF)]

This option controls both of the transformations controlled by back_substitution and interleave_reductions. See their descriptions.

-SWP:if_conversion[=(ON|OFF)]

SWP generally works much better on loop bodies without internal branches caused by conditional execution. This option causes conditional branches to be removed when possible by using conditional move instructions (MIPS IV) and equivalents. For example, consider the code:

DO i=1,n

IF ( a(i) .LT. b(i) ) THEN

c(i) = a(i)

ELSE

c(i) = b(i)

END IF

END DO

The loop body is compiled for MIPS IV as:

ldc1 $f0,a(i)

ldc1 $f1,b(i)

c.lt.s cc,$f0,$f1

movf.s $f0,$f1,cc

sdc1 $f0,c(i)

Note that no conditional branches occur in the code. This option is ON by default for MIPS IV targets only.

-SWP:interleave_reductions[=(ON|OFF)]

This option, ON by default, has the same motivation as back-substitution. It allows transformations that make recurrences arising from reductions less severe by interleaving multiple threads of the reduction and then piecing them together at the end of the loop. For example, consider the code to sum an array:

DO i=1,n

sum = sum + a(x)

END DO

Without interleaving, each iteration must wait for the previous iteration's add to complete, yielding a best case iteration interval of 4 cycles/iteration on the R8000. Interleaving can transform the loop to something equivalent to:

DO i=1,n,8

sum1 = sum1 + a(i)

sum2 = sum2 + a(i+1)

sum3 = sum3 + a(i+2)

sum4 = sum4 + a(i+3)

sum5 = sum5 + a(i+4)

sum6 = sum6 + a(i+5)

sum7 = sum7 + a(i+6)

sum8 = sum8 + a(i+7)

END DO

sum = sum1+sum2+sum3+sum4+sum5+sum6+sum7+sum8

This version can achieve an effective iteration interval of nearly 0.5 cycles. These transformations generally require -OPT:roundoff=2 or better.

-SWP:trip_count_min=n

SWP is not attempted for loops with trip counts known to be smaller than n (the default is 5). The limit is applied via a run-time test for cases where the trip count is not known at compile time. Sometimes, a longer loop body can be profitably pipelined even with a smaller trip count, enabled by this option.

-SWP:unroll_times_max=n

This option controls the maximum number of times inner loop bodies are unrolled before attempting pipelining. The default is 2 for MIPS IV, and 1 for MIPS I-III. Unrolling is also constrained by the body_ins_count_max option described above. (Unrolling of loop bodies not expected to be software pipelined is controlled separately by -OPT:unroll_size and -OPT:unroll_times_max.)

The -S option to the compiler provides information about software pipelining, the loops it works on, and the results. See "General Options for Compiler Drivers" for more information.